Extending an on-line parallel corpus management system to handle specific types of structured documents
نویسندگان
چکیده
Parallel bilingual or multilingual corpora are often handled as collections of segments without any specific document organization. We describe SECTra_w, a web-oriented system which has been used for online MT evaluations, and has recently been extended to handle multimodal documents such as French-Chinese/Vietnamese/Hindi/Tamil interpreted bilingual spontaneous dialogues, mainly spoken but also using some short texts, and multilingual written articles of an online encyclopedia annotated with UNL graphs.
منابع مشابه
Comparing k-means clusters on parallel Persian-English corpus
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...
متن کاملBayesian network model for semi-structured document classification
Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a gene...
متن کاملDomain-Specific Track CLEF 2005: Overview of Results and Approaches, Remarks on the Assessment Anaalysis
The domain-specific track aims at monoand cross-language information retrieval on structured scientific data. This track studies retrieval in a domain-specific context using two social science databases: The German Indexing and Retrieval Testdatabase (GIRT) (forth version GIRT-4: German/English pseudo-parallel corpus with identical documents) with 302,638 documents in total, and the Russian Soc...
متن کاملThe Profile of Patients’ Complaints in a Regional Hospital
Background A hospital should be an institution of understanding and respecting patients’ rights, their families, physicians and other caregivers. Hospitals and all other healthcare centers must be cautious toward respecting ethical aspects of care and treatment. On the other hand, patients’ satisfaction reflects capabilities of physicians and medical staff as well as the extent patients’ rights...
متن کاملCommand-line utilities for managing and exploring annotated corpora
Users of annotated corpora frequently perform basic operations such as inspecting the available annotations, filtering documents, formatting data, and aggregating basic statistics over a corpus. While these may be easily performed over flat text files with stream-processing UNIX tools, similar tools for structured annotation require custom design. Dawborn and Curran (2014) have developed a decl...
متن کامل